Skip to content

Add s3 url config#112

Open
githubzilla wants to merge 1 commit intomainfrom
add_s3_url_config
Open

Add s3 url config#112
githubzilla wants to merge 1 commit intomainfrom
add_s3_url_config

Conversation

@githubzilla
Copy link
Collaborator

@githubzilla githubzilla commented Oct 22, 2025

Summary by CodeRabbit

  • Documentation

    • Added a comprehensive RocksDB Cloud migration guide with URL-based examples (AWS S3, MinIO, HTTPS), TxLog migration, production notes, precedence, backward-compatibility, and deprecation guidance for legacy prefix usage.
  • New Features

    • Introduced URL-based S3 configuration (rocksdb_cloud_s3_url) as the recommended option; supports s3://, http://, https:// and overrides legacy settings.
    • Startup validation and clearer logging to show effective configuration and surface invalid-URL errors.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 22, 2025

Walkthrough

Adds a URL-based RocksDB Cloud S3 configuration (rocksdb_cloud_s3_url / s3_url), a ParseS3Url parser/validator, and logic that makes URL-derived bucket/path/endpoint override legacy bucket/prefix/object_path/endpoint; updates flags, docs, and S3 client construction to use the effective configuration.

Changes

Cohort / File(s) Summary
Documentation
eloq_data_store_service/README.md, eloq_data_store_service/RocksDB_Configuration_Flags.md
Adds a RocksDB Cloud Configuration Migration Guide and docs for the new URL-based config (rocksdb_cloud_s3_url), examples, supported protocols, precedence/deprecation notes, and duplicated guidance for production deployment.
Configuration Header
eloq_data_store_service/rocksdb_config.h
Adds struct S3UrlComponents, declares ParseS3Url(const std::string&), adds std::string s3_url_ to RocksDBCloudConfig, and IsS3UrlConfigured() accessor.
Configuration Implementation
eloq_data_store_service/rocksdb_config.cpp
Adds DEFINE_string(rocksdb_cloud_s3_url, ...), initializes s3_url_ preferring the CLI flag over legacy config, and integrates URL-handling precedence into RocksDBCloudConfig/RocksDBConfig initialization.
Storage Implementation
eloq_data_store_service/rocksdb_cloud_data_store.cpp
When s3_url_ is set, calls ParseS3Url, derives effective_bucket_name, effective_bucket_prefix (cleared), effective_object_path, and effective_endpoint_url; validates URL (fatal on invalid); uses effective values for S3 client creation and logging.
CLI / Dump Utility
eloq_data_store_service/rocksdb_cloud_dump.cpp
Adds s3_url flag, parses/validates it at argument time, derives bucket/object/endpoint (clearing bucket_prefix), sets s3_endpoint_url used by client factory, and enforces s3_url precedence over legacy flags.
Flag/Params Additions
eloq_data_store_service/rocksdb_cloud_dump.cpp, eloq_data_store_service/rocksdb_config.cpp
New public flags: s3_url (dump tool) and rocksdb_cloud_s3_url (config); updates legacy flag descriptions to indicate deprecation and that URL takes precedence.
Public API Adjustments
eloq_data_store_service/rocksdb_cloud_data_store.cpp, related headers
Uses computed effective_* values rather than legacy fields directly; updates calls to SetBucketName to single-argument form and uses a new SetBucketPrefix API where needed.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Flags as Flags / CLI / Config
    participant CFG as RocksDBCloudConfig
    participant Parser as ParseS3Url()
    participant Store as RocksDB Cloud Store / S3 Client

    Flags->>CFG: read rocksdb_cloud_s3_url / legacy fields
    alt s3_url provided
        CFG->>Parser: ParseS3Url(s3_url)
        Parser-->>CFG: S3UrlComponents{valid?, protocol, endpoint, bucket, path}
        CFG->>CFG: set s3_url_, derive effective_bucket, effective_path, effective_endpoint\nclear bucket_prefix
        CFG-->>Store: provide effective config (URL-based)
    else no s3_url
        CFG-->>Store: provide legacy config (bucket_name, bucket_prefix, object_path, s3_endpoint)
    end
    Store->>Store: create S3 client (effective_endpoint)\nproceed with storage operations
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Pay attention to: S3 URL parsing/validation (ParseS3Url) and its edge cases.
  • Public API changes: SetBucketName signature change and added SetBucketPrefix usage.
  • Consistency where effective_* values replace legacy fields across client factory creation and logging.

Possibly related PRs

  • fix gcs compile error. #75 — Modifies related RocksDB cloud S3 code paths (BuildS3ClientFactory / conditional S3 compilation) and may overlap with endpoint/client factory changes.
  • Standalone purger support #81 — Changes handling of bucket/prefix/object path in RocksDB cloud code; likely intersects with effective bucket/prefix logic.
  • Warm up local ssd of standby node #145 — Alters RocksDB cloud S3 paths (bucket/prefix/object path/endpoint) and may conflict or interact with these URL-based overrides.

Suggested reviewers

  • liunyl
  • lzxddz

Poem

🐰 I bounced through flags and parsed a URL bright,
Buckets found, prefixes cleared in morning light.
Endpoints set, validators checked tight,
Docs updated, the rabbit hops with delight—
Configuration now leaps just right! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'Add s3 url config' is vague and generic, using non-descriptive language that doesn't convey the significant scope and impact of the changes. Consider a more specific title that captures the primary change, such as 'Introduce URL-based S3 configuration for RocksDB Cloud' or 'Add S3 URL-based config with legacy parameter deprecation'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add_s3_url_config

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)
eloq_data_store_service/rocksdb_cloud_dump.cpp (4)

176-184: Allow default AWS credential provider chain (don’t require keys).

parse_arguments throws if keys are missing, but main later supports provider chain. Relax this to improve UX.

-    if (params.aws_access_key_id.empty())
-    {
-        throw std::runtime_error("AWS access key ID is required");
-    }
-    if (params.aws_secret_key.empty())
-    {
-        throw std::runtime_error("AWS secret access key is required");
-    }
+    // Credentials may be supplied via env/instance profiles; do not hard-require here.

204-266: HTTPS verification disabled and endpoint lowercased — security and correctness.

  • verifySSL=false breaks TLS security.
  • Lowercasing the entire endpoint can corrupt case-sensitive paths.

Add a flag and preserve endpoint case; verify HTTPS by default.

+DEFINE_bool(s3_skip_ssl_verify, false, "Skip TLS certificate verification for custom S3 endpoint (not recommended)");
 ...
-    return [endpoint](const std::shared_ptr<Aws::Auth::AWSCredentialsProvider>
+    return [endpoint](const std::shared_ptr<Aws::Auth::AWSCredentialsProvider>
                           &credentialsProvider,
                       const Aws::Client::ClientConfiguration &baseConfig)
                -> std::shared_ptr<Aws::S3::S3Client>
     {
-        // Check endpoint url start with http or https
+        // Check endpoint URL scheme
         if (endpoint.empty())
         {
             return nullptr;
         }
-
-        std::string endpoint_url = endpoint;
-        std::transform(endpoint_url.begin(),
-                       endpoint_url.end(),
-                       endpoint_url.begin(),
-                       [](unsigned char c) { return std::tolower(c); });
+        const std::string endpoint_lc = [&]{
+            std::string t = endpoint;
+            std::transform(t.begin(), t.end(), t.begin(), [](unsigned char c){ return std::tolower(c); });
+            return t;
+        }();
 
         bool secured_url = false;
-        if (endpoint_url.rfind("http://", 0) == 0)
+        if (endpoint_lc.rfind("http://", 0) == 0)
         {
             secured_url = false;
         }
-        else if (endpoint_url.rfind("https://", 0) == 0)
+        else if (endpoint_lc.rfind("https://", 0) == 0)
         {
             secured_url = true;
         }
         else
         {
             LOG(ERROR) << "Invalid S3 endpoint URL";
             std::abort();
         }
 
         // Create a new configuration based on the base config
         Aws::Client::ClientConfiguration config = baseConfig;
-        config.endpointOverride = endpoint_url;
+        config.endpointOverride = endpoint; // preserve original case
         if (secured_url)
         {
             config.scheme = Aws::Http::Scheme::HTTPS;
         }
         else
         {
             config.scheme = Aws::Http::Scheme::HTTP;
         }
-        // Disable SSL verification for HTTPS
-        config.verifySSL = false;
+        // Verify TLS by default; allow opt-out for dev-only scenarios
+        config.verifySSL = secured_url && !FLAGS_s3_skip_ssl_verify;

476-484: Bug: status not assigned from ListColumnFamilies result.

You check a stale status. Capture the return.

-        rocksdb::DBCloud::ListColumnFamilies(
-            options, params.db_path, &column_families);
-        if (!status.ok())
+        status = rocksdb::DBCloud::ListColumnFamilies(
+            options, params.db_path, &column_families);
+        if (!status.ok())
         {
             LOG(ERROR) << "Cannot list column families: " << status.ToString();
             throw std::runtime_error("Cannot list column families: " +
                                      status.ToString());
         }

363-388: Ensure Aws::ShutdownAPI runs on exceptions.

Move Aws::SDKOptions outside try and guard Shutdown in catch.

-int main(int argc, char **argv)
+int main(int argc, char **argv)
 {
     // Initialize Google's logging library
     google::InitGoogleLogging(argv[0]);
     FLAGS_logtostderr = 1;  // Log to stderr by default
 
     // Set up gflags
     gflags::SetUsageMessage("RocksDB Cloud Dump Utility");
     gflags::SetVersionString("1.0.0");
 
-    // Parse command line flags
+    // Parse command line flags
     gflags::ParseCommandLineFlags(&argc, &argv, true);
 
+    Aws::SDKOptions aws_options;
+    bool aws_inited = false;
     try
     {
         // Parse command line arguments
         CmdLineParams params = parse_arguments();
 
         // Create directory for DB path if it doesn't exist
         std::filesystem::create_directories(params.db_path);
 
-        // Initialize AWS SDK
-        Aws::SDKOptions aws_options;
-        Aws::InitAPI(aws_options);
+        // Initialize AWS SDK
+        Aws::InitAPI(aws_options);
+        aws_inited = true;
 ...
-            Aws::ShutdownAPI(aws_options);
+            Aws::ShutdownAPI(aws_options);
             google::ShutdownGoogleLogging();
             return 0;
         }
 ...
-        Aws::ShutdownAPI(aws_options);
+        Aws::ShutdownAPI(aws_options);
 
         return 0;
     }
     catch (const std::exception &e)
     {
         LOG(ERROR) << "Error: " << e.what();
         print_usage(argv[0]);
+        if (aws_inited) Aws::ShutdownAPI(aws_options);
         google::ShutdownGoogleLogging();
         return 1;
     }
 
     // Clean up
     gflags::ShutDownCommandLineFlags();
     google::ShutdownGoogleLogging();

Also applies to: 663-669

eloq_data_store_service/rocksdb_cloud_data_store.cpp (3)

219-249: HTTPS verification disabled and endpoint lowercased — fix like dump tool.

Preserve endpoint case; verify HTTPS by default.

-        std::string endpoint_url = toLower(endpoint);
+        const std::string endpoint_lc = toLower(endpoint);
 
         bool secured_url = false;
-        if (endpoint_url.rfind("http://", 0) == 0)
+        if (endpoint_lc.rfind("http://", 0) == 0)
         {
             secured_url = false;
         }
-        else if (endpoint_url.rfind("https://", 0) == 0)
+        else if (endpoint_lc.rfind("https://", 0) == 0)
         {
             secured_url = true;
         }
         else
         {
             LOG(ERROR) << "Invalid S3 endpoint url";
             std::abort();
         }
 ...
-        config.endpointOverride = endpoint_url;
+        config.endpointOverride = endpoint; // preserve original case
 ...
-        // Disable SSL verification for HTTPS
-        config.verifySSL = false;
+        // Verify TLS when using HTTPS
+        config.verifySSL = secured_url;

423-431: Manifest scan uses legacy fields; ignore URL overrides.

Use the effective bucket/prefix/object_path already applied to cfs_options_.

-    bool ret = FindMaxTermFromCloudManifestFiles(storage_provider,
-                                                 cloud_config_.bucket_prefix_,
-                                                 cloud_config_.bucket_name_,
-                                                 cloud_config_.object_path_,
+    bool ret = FindMaxTermFromCloudManifestFiles(storage_provider,
+                                                 cfs_options_.src_bucket.GetBucketPrefix(),
+                                                 cfs_options_.src_bucket.GetBucketName(),
+                                                 cfs_options_.src_bucket.GetObjectPath(),
                                                  cloud_config_.branch_name_,
                                                  dss_shard_id,
                                                  cloud_manifest_prefix,
                                                  max_term);

639-646: Purger listener bucket uses legacy fields; ignore URL overrides.

Derive from the effective cfs_options passed to OpenCloudDB.

-    std::string bucket_name =
-        cloud_config_.bucket_prefix_ + cloud_config_.bucket_name_;
+    std::string bucket_name = cfs_options.src_bucket.GetBucketPrefix() +
+                              cfs_options.src_bucket.GetBucketName();
 ...
-        cloud_config_.object_path_,
+        cfs_options.src_bucket.GetObjectPath(),
🧹 Nitpick comments (11)
eloq_data_store_service/README.md (3)

145-153: Call out prefix removal and object_path requirement inline.

Add a short note under the URL example that bucket_prefix is ignored with URL config and the object_path segment is mandatory (URLs like s3://bucket only will fail validation). This prevents migration surprises.

Apply:

 rocksdb_cloud_s3_url = s3://my-eloqdata-bucket/rocksdb_cloud
+; Note: bucket_prefix is not supported with URL config. If you previously used a prefix,
+; fold it into the bucket name (e.g., s3://prod-my-eloqdata-bucket/rocksdb_cloud).
+; The object_path segment (after the bucket) is required.

155-163: MinIO example: add HTTPS and cert note.

Provide an HTTPS variant and a tip that custom CAs/self‑signed certs require proper trust configuration rather than disabling verification in code.

Apply:

 rocksdb_cloud_s3_url = http://localhost:9000/my-bucket/rocksdb_data
+; or, with TLS:
+; rocksdb_cloud_s3_url = https://minio.local:9000/my-bucket/rocksdb_data
+; Ensure the TLS certificate is trusted by the host (install CA if self-signed).

286-300: Document strict URL parsing.

Explicitly note that:

  • object_path cannot be empty (trailing slash like s3://bucket/ is invalid),
  • no leading slash before bucket (s3:///bucket/... is invalid).

Add to Important Notes:

 5. **Validation:** Invalid URLs will cause the application to fail at startup with a descriptive error message.
+6. **URL shape:** The path segment after the bucket is required (no trailing slash-only paths). Examples:
+   - Valid: `s3://bucket/data`, `http://host:9000/bucket/data`
+   - Invalid: `s3://bucket/` (missing object_path), `s3:///bucket/data` (extra slash)
eloq_data_store_service/RocksDB_Configuration_Flags.md (2)

101-106: Clarify object_path is mandatory and no trailing slash.

State explicitly that the segment after the bucket must be non-empty; s3://bucket/ is invalid and will be rejected.

Apply:

 **URL Format Specification:**
-- S3: `s3://{bucket_name}/{object_path}`
+- S3: `s3://{bucket_name}/{object_path}` (object_path is required; no trailing slash-only paths)
 - HTTP/HTTPS: `http(s)://{host}:{port}/{bucket_name}/{object_path}`

116-117: Precedence note is good. Add bucket_prefix reminder.

Reiterate here that bucket_prefix is ignored when URL is used to avoid misconfiguration.

Apply:

 **Note:** If both `rocksdb_cloud_s3_url` and legacy options are provided, the URL-based configuration takes precedence and overrides the legacy settings.
+When URL-based config is active, any `rocksdb_cloud_bucket_prefix` value is ignored.
eloq_data_store_service/rocksdb_config.h (1)

118-120: noexcept on trivial accessor.

Mark IsS3UrlConfigured as noexcept.

-    bool IsS3UrlConfigured() const { return !s3_url_.empty(); }
+    bool IsS3UrlConfigured() const noexcept { return !s3_url_.empty(); }
eloq_data_store_service/rocksdb_cloud_dump.cpp (1)

72-96: Update usage to include s3_url and deprecations.

Help users discover the new flag and legacy status.

-        "--bucket_name=BUCKET --bucket_prefix=PREFIX [options]\n"
+        "[--s3_url=s3://BUCKET/PATH | --bucket_name=BUCKET --bucket_prefix=PREFIX] [options]\n"
 ...
-        "  --bucket_name=BUCKET         S3 bucket name\n"
-        "  --bucket_prefix=PREFIX       S3 bucket prefix\n"
-        "  --object_path=PATH           S3 object path for RocksDB Cloud "
+        "  --s3_url=URL                 URL form (recommended): s3://bucket/path or http(s)://host:port/bucket/path\n"
+        "  --bucket_name=BUCKET         Legacy: S3 bucket name\n"
+        "  --bucket_prefix=PREFIX       Legacy: S3 bucket prefix (ignored with --s3_url)\n"
+        "  --object_path=PATH           Legacy: S3 object path for RocksDB Cloud "
+        "  --s3_endpoint=URL            Legacy: Custom S3 endpoint URL (use URL host in --s3_url instead)\n"
eloq_data_store_service/rocksdb_cloud_data_store.cpp (1)

378-386: Transfer Manager toggle is endpoint-driven; log rationale.

Optional: add a short log when disabling transfer manager due to custom endpoint to aid ops.

     if (!effective_endpoint_url.empty())
     {
         cfs_options_.s3_client_factory =
             BuildS3ClientFactory(effective_endpoint_url);
         // Intermittent and unpredictable IOError happend from time to
         // time when using aws transfer manager with minio. Disable aws
         // transfer manager if endpoint is set (minio).
         cfs_options_.use_aws_transfer_manager = false;
+        LOG(INFO) << "Custom S3 endpoint set; disabling AWS Transfer Manager.";
     }
eloq_data_store_service/rocksdb_config.cpp (3)

296-304: Clarify URL style and consider guarding the flag.

  • Help text implies path-style addressing only; explicitly say virtual-hosted-style (e.g., https://bucket.s3.amazonaws.com/path) is not supported to avoid confusion.
  • For consistency with other cloud flags, consider wrapping this DEFINE under the same S3/GCS build guard to hide it from non-cloud builds.

Apply this doc tweak (optional):

 DEFINE_string(rocksdb_cloud_s3_url,
               "",
-              "RocksDB cloud S3 URL. Format: s3://{bucket}/{path} or "
-              "http(s)://{host}:{port}/{bucket}/{path}. "
+              "RocksDB cloud S3 URL. Supported formats (path‑style only): "
+              "s3://{bucket}/{path} or http(s)://{host}[:port]/{bucket}/{path}. "
+              "Virtual‑hosted style (e.g., https://{bucket}.s3.{region}.amazonaws.com/{path}) is not supported. "
               "Examples: s3://my-bucket/my-path, "
               "http://localhost:9000/my-bucket/my-path. "
               "This option takes precedence over legacy configuration options "
               "if both are provided");

Optionally guard the flag:

+#if (defined(DATA_STORE_TYPE_ELOQDSS_ROCKSDB_CLOUD_S3) || \
+     defined(DATA_STORE_TYPE_ELOQDSS_ROCKSDB_CLOUD_GCS))
 DEFINE_string(rocksdb_cloud_s3_url,
 ...
               "if both are provided");
+#endif

307-399: Make URL parsing more robust (scheme normalization, slash handling, minor shadowing).

Parser works for the documented formats but can be hardened:

  • Normalize scheme to lowercase to accept “S3/HTTP/HTTPS”.
  • Trim redundant leading slashes after the endpoint split; strip trailing slashes from object_path.
  • Avoid reusing the name first_slash in two scopes (small readability nit).

Apply minimally invasive changes:

 S3UrlComponents ParseS3Url(const std::string &s3_url)
 {
     S3UrlComponents result;

@@
-    result.protocol = s3_url.substr(0, protocol_end);
+    result.protocol = s3_url.substr(0, protocol_end);
+    // Normalize protocol to lowercase
+    std::transform(result.protocol.begin(),
+                   result.protocol.end(),
+                   result.protocol.begin(),
+                   [](unsigned char c) { return static_cast<char>(std::tolower(c)); });
@@
-    if (result.protocol == "http" || result.protocol == "https")
+    if (result.protocol == "http" || result.protocol == "https")
     {
         // Find the first slash after the host:port
-        size_t first_slash = remaining.find('/');
-        if (first_slash == std::string::npos)
+        size_t endpoint_sep = remaining.find('/');
+        if (endpoint_sep == std::string::npos)
         {
             result.error_message =
                 "Invalid S3 URL format: missing bucket and object path";
             return result;
         }

         // Store the full endpoint URL including protocol
-        result.endpoint_url =
-            result.protocol + "://" + remaining.substr(0, first_slash);
-        remaining = remaining.substr(first_slash + 1);
+        result.endpoint_url =
+            result.protocol + "://" + remaining.substr(0, endpoint_sep);
+        remaining = remaining.substr(endpoint_sep + 1);
+        // Remove any leading slashes before bucket
+        while (!remaining.empty() && remaining.front() == '/')
+        {
+            remaining.erase(0, 1);
+        }
     }
@@
-    size_t first_slash = remaining.find('/');
-    if (first_slash == std::string::npos)
+    size_t key_sep = remaining.find('/');
+    if (key_sep == std::string::npos)
     {
         result.error_message =
             "Invalid S3 URL format: missing object path (format: "
             "{bucket_name}/{object_path})";
         return result;
     }

-    result.bucket_name = remaining.substr(0, first_slash);
-    result.object_path = remaining.substr(first_slash + 1);
+    result.bucket_name = remaining.substr(0, key_sep);
+    result.object_path = remaining.substr(key_sep + 1);
+    // Normalize object_path: strip trailing slashes
+    while (!result.object_path.empty() && result.object_path.back() == '/')
+    {
+        result.object_path.pop_back();
+    }

Note: you’ll need #include <algorithm> and #include <cctype> at the top.

Please confirm S3UrlComponents default-initializes is_valid to false; otherwise set result.is_valid = false; on construction paths.


613-619: Precedence is read; enforce it and surface conflicts.

You read s3_url_ first (good). When both URL and legacy fields are set, ensure downstream code:

  • Ignores legacy bucket_prefix_/bucket_name_/object_path_ and logs a clear warning, or
  • Fails fast if both are present.

If not already enforced elsewhere, add a post-load check in this ctor after legacy fields are read:

if (!s3_url_.empty()) {
    if (!bucket_prefix_.empty()) {
        LOG(WARNING) << "rocksdb_cloud_bucket_prefix is ignored when rocksdb_cloud_s3_url is set.";
        bucket_prefix_.clear();
    }
    // Consider warning for bucket_name_/object_path_ too, since the URL is the source of truth.
}

Would you like a follow-up patch plus unit tests covering precedence and logging?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd88ee6 and b4b2f8b.

📒 Files selected for processing (6)
  • eloq_data_store_service/README.md (3 hunks)
  • eloq_data_store_service/RocksDB_Configuration_Flags.md (1 hunks)
  • eloq_data_store_service/rocksdb_cloud_data_store.cpp (2 hunks)
  • eloq_data_store_service/rocksdb_cloud_dump.cpp (2 hunks)
  • eloq_data_store_service/rocksdb_config.cpp (4 hunks)
  • eloq_data_store_service/rocksdb_config.h (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
eloq_data_store_service/rocksdb_config.h (1)
eloq_data_store_service/rocksdb_config.cpp (2)
  • ParseS3Url (310-397)
  • ParseS3Url (310-310)
eloq_data_store_service/rocksdb_config.cpp (1)
eloq_data_store_service/main.cpp (2)
  • CheckCommandLineFlagIsDefault (90-102)
  • CheckCommandLineFlagIsDefault (90-90)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (1)
eloq_data_store_service/rocksdb_config.cpp (2)
  • ParseS3Url (310-397)
  • ParseS3Url (310-310)
eloq_data_store_service/rocksdb_cloud_dump.cpp (1)
eloq_data_store_service/rocksdb_config.cpp (2)
  • ParseS3Url (310-397)
  • ParseS3Url (310-310)
🔇 Additional comments (3)
eloq_data_store_service/rocksdb_config.h (1)

74-83: S3UrlComponents looks good.

Structure and defaults are sensible.

eloq_data_store_service/rocksdb_config.cpp (2)

275-275: LGTM: default-on purger flag.

Enabling the purger by default is reasonable and matches the constructor wiring. No issues here.


677-677: Overflow concern properly addressed.

The code correctly implements the fix. At lines 677–678, rocksdb_cloud_purger_periodicity_secs is declared as uint64_t, converting the uint32 flag value before the multiplication by 1000 at line 690. This avoids overflow. The target field purger_periodicity_millis_ is size_t, which is 64-bit on modern systems.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (1)

422-430: Fix inconsistent bucket configuration usage in FindMaxTermFromCloudManifestFiles and PurgerEventListener.

Lines 423-430 and 640 use cloud_config_.bucket_prefix_, cloud_config_.bucket_name_, and cloud_config_.object_path_ directly, but these should use the effective configuration values computed at lines 301-331. When S3 URL-based configuration is active, effective_bucket_prefix is explicitly set to empty (line 324), whereas cloud_config_.bucket_prefix_ retains the legacy value. This causes FindMaxTermFromCloudManifestFiles to search with incorrect paths and PurgerEventListener to use wrong bucket names when URL config is active.

Use the effective configuration variables:

  • Replace cloud_config_.bucket_prefix_ with effective_bucket_prefix
  • Replace cloud_config_.bucket_name_ with effective_bucket_name
  • Replace cloud_config_.object_path_ with effective_object_path
🧹 Nitpick comments (1)
eloq_data_store_service/rocksdb_config.h (1)

92-177: Use distinct variable names for improved clarity.

The variable first_slash is declared at line 136 (inside the http/https block) and again at line 152 (after the block). While technically in different scopes, consider renaming to endpoint_slash and bucket_slash respectively for better readability.

Regarding other edge cases: Trailing slashes in URLs like s3://bucket/path/ will result in object_path="path/" with the trailing slash preserved. This appears to be expected behavior. URL-encoded characters (e.g., %20) don't appear to be supported by this parser, which aligns with the codebase design—URL decoding is expected to be handled by the caller before passing URLs to ParseS3Url.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4b2f8b and 70ef679.

📒 Files selected for processing (6)
  • eloq_data_store_service/README.md (3 hunks)
  • eloq_data_store_service/RocksDB_Configuration_Flags.md (1 hunks)
  • eloq_data_store_service/rocksdb_cloud_data_store.cpp (4 hunks)
  • eloq_data_store_service/rocksdb_cloud_dump.cpp (2 hunks)
  • eloq_data_store_service/rocksdb_config.cpp (4 hunks)
  • eloq_data_store_service/rocksdb_config.h (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • eloq_data_store_service/README.md
  • eloq_data_store_service/rocksdb_cloud_dump.cpp
🧰 Additional context used
🧬 Code graph analysis (2)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (1)
eloq_data_store_service/rocksdb_config.h (1)
  • ParseS3Url (92-177)
eloq_data_store_service/rocksdb_config.cpp (1)
eloq_data_store_service/main.cpp (2)
  • CheckCommandLineFlagIsDefault (90-102)
  • CheckCommandLineFlagIsDefault (90-90)
🔇 Additional comments (9)
eloq_data_store_service/rocksdb_config.h (2)

74-82: LGTM! Well-structured result type.

The S3UrlComponents struct provides a clear way to return parsed URL components with validation state and error messaging.


197-204: LGTM! Clean integration with existing config structure.

The new s3_url_ member and IsS3UrlConfigured() accessor follow the existing patterns in RocksDBCloudConfig.

eloq_data_store_service/RocksDB_Configuration_Flags.md (1)

90-117: LGTM! Clear documentation with good migration guidance.

The documentation effectively communicates:

  • The new URL-based configuration approach
  • Supported URL formats with examples
  • Deprecation of legacy options
  • Precedence rules when both configurations are present

This will help users understand the new feature and migrate from legacy configuration.

eloq_data_store_service/rocksdb_cloud_data_store.cpp (4)

299-334: LGTM! Clean precedence logic with good validation.

The effective configuration determination correctly:

  • Starts with legacy configuration as defaults
  • Parses and validates the S3 URL when provided
  • Overrides legacy values with URL-derived values (including setting bucket_prefix to empty)
  • Logs the effective configuration for debugging

The error handling with LOG(FATAL) and detailed error messages will help users quickly identify and fix invalid URL formats.


336-343: LGTM! Correct application of effective configuration.

The code properly applies the effective configuration values to both source and destination buckets in cfs_options_. The region remains from cloud_config_.region_ as expected, since it's not part of the URL-based configuration.


378-392: LGTM! Proper endpoint handling with AWS transfer manager logic.

The code correctly:

  • Uses effective_endpoint_url which supports both URL-based and legacy configuration
  • Builds a custom S3 client factory when an endpoint is specified
  • Disables AWS transfer manager for custom endpoints (MinIO compatibility)
  • Enables AWS transfer manager for standard AWS S3

639-645: Inconsistent use of bucket configuration in PurgerEventListener.

Similar to the issue at lines 424-430, this code concatenates cloud_config_.bucket_prefix_ with cloud_config_.bucket_name_ when initializing PurgerEventListener. When URL-based configuration is used:

  • The effective bucket prefix should be empty (per line 321)
  • The effective bucket name comes from the URL

Consider using effective configuration values here as well to maintain consistency with the URL-based configuration paradigm.

eloq_data_store_service/rocksdb_config.cpp (2)

296-303: LGTM! Well-documented flag definition.

The rocksdb_cloud_s3_url flag is well-defined with:

  • Clear format specification
  • Practical examples
  • Explicit precedence documentation
  • Appropriate default value (empty string for opt-in behavior)

519-525: LGTM! Consistent precedence handling.

The S3 URL configuration initialization correctly implements the precedence order (command line > config file > default) and follows the exact pattern used for all other configuration flags in this constructor.

Add s3_url for rocksdb_cloud_dump

Move parse s3_url into rocksdb_cloud_data_store.cpp

Enable verifySSL for S3 endpoint

Move ParseS3Url
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (1)

422-430: Fix inconsistent configuration usage: FindMaxTermFromCloudManifestFiles should use effective config values, not legacy config.

The FindMaxTermFromCloudManifestFiles call on lines 423-430 uses cloud_config_.bucket_prefix_, cloud_config_.bucket_name_, and cloud_config_.object_path_ directly from the original config instead of the computed effective values. This causes incorrect behavior when URL-based configuration is used, as the URL-based config sets effective_bucket_prefix = "" (line 321) while FindMaxTermFromCloudManifestFiles would still use the legacy cloud_config_.bucket_prefix_ value.

Update the call to use effective configuration values for consistency with the cfs_options initialization (lines 336-343):

     bool ret = FindMaxTermFromCloudManifestFiles(storage_provider,
-                                                cloud_config_.bucket_prefix_,
-                                                cloud_config_.bucket_name_,
-                                                cloud_config_.object_path_,
+                                                effective_bucket_prefix,
+                                                effective_bucket_name,
+                                                effective_object_path,
                                                 cloud_config_.branch_name_,
                                                 dss_shard_id,
                                                 cloud_manifest_prefix,
                                                 max_term);

Additionally, lines 639-645 in the event listener initialization also use legacy config directly (bucket_name and object_path). These should similarly be updated to use effective configuration values.

🧹 Nitpick comments (2)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (1)

299-334: Verify LOG(FATAL) behavior and improve error message.

The URL parsing and override logic is well-implemented with clear logging. However:

  1. LOG(FATAL) terminates the process immediately - this is appropriate for invalid configuration at startup, but ensure this is the intended behavior for your deployment model.

  2. Error message could be more actionable: Consider adding a suggestion about which configuration to check or remove:

         LOG(FATAL) << "Invalid rocksdb_cloud_s3_url: "
                    << url_components.error_message
                    << ". URL format: s3://{bucket}/{path} or "
                       "http(s)://{host}:{port}/{bucket}/{path}. "
                    << "Examples: s3://my-bucket/my-path, "
-                   << "http://localhost:9000/my-bucket/my-path";
+                   << "http://localhost:9000/my-bucket/my-path. "
+                   << "Please correct the rocksdb_cloud_s3_url configuration or remove it to use legacy configuration.";
eloq_data_store_service/rocksdb_cloud_dump.cpp (1)

52-53: Fix inconsistent flag description format.

The bucket_prefix flag description on line 53 should match the format of other legacy flags for consistency:

-DEFINE_string(bucket_prefix, "", "S3 bucket prefix (legacy, not supported with s3_url)");
+DEFINE_string(bucket_prefix, "", "S3 bucket prefix (legacy, use s3_url instead)");

The phrase "not supported with s3_url" might be confusing - it's better to consistently use "use s3_url instead" like the other flags.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 70ef679 and 0dd505b.

📒 Files selected for processing (6)
  • eloq_data_store_service/README.md (3 hunks)
  • eloq_data_store_service/RocksDB_Configuration_Flags.md (1 hunks)
  • eloq_data_store_service/rocksdb_cloud_data_store.cpp (4 hunks)
  • eloq_data_store_service/rocksdb_cloud_dump.cpp (2 hunks)
  • eloq_data_store_service/rocksdb_config.cpp (4 hunks)
  • eloq_data_store_service/rocksdb_config.h (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
eloq_data_store_service/rocksdb_config.cpp (2)
eloq_data_store_service/main.cpp (2)
  • CheckCommandLineFlagIsDefault (90-102)
  • CheckCommandLineFlagIsDefault (90-90)
rocksdb_config.cpp (2)
  • CheckCommandLineFlagIsDefault (298-310)
  • CheckCommandLineFlagIsDefault (298-298)
eloq_data_store_service/rocksdb_cloud_dump.cpp (1)
eloq_data_store_service/rocksdb_config.h (1)
  • ParseS3Url (92-177)
eloq_data_store_service/rocksdb_cloud_data_store.cpp (1)
eloq_data_store_service/rocksdb_config.h (1)
  • ParseS3Url (92-177)
🔇 Additional comments (6)
eloq_data_store_service/README.md (1)

199-300: LGTM! Comprehensive migration guide.

The migration guide is well-structured and provides clear examples for different use cases (AWS S3, MinIO, HTTPS S3-compatible storage). The precedence rules and backward compatibility notes are helpful for users migrating from legacy configuration.

eloq_data_store_service/RocksDB_Configuration_Flags.md (1)

90-117: LGTM! Clear documentation of URL-based configuration.

The documentation clearly explains the new URL format, provides examples, and marks legacy configuration as deprecated. The precedence rules are explicit and helpful.

eloq_data_store_service/rocksdb_config.h (1)

74-82: LGTM! Well-designed URL components struct.

The struct provides all necessary fields for storing parsed URL components with proper validation state and error reporting.

eloq_data_store_service/rocksdb_config.cpp (2)

296-303: LGTM! Well-documented flag definition.

The flag definition includes comprehensive description with format examples and precedence notes. This will help users understand how to use the new URL-based configuration.


519-525: LGTM! Correct precedence implementation.

The s3_url_ initialization follows the established pattern of command-line flags taking precedence over configuration file values. This is consistent with the rest of the configuration system.

eloq_data_store_service/rocksdb_cloud_dump.cpp (1)

125-154: LGTM! Consistent URL parsing implementation.

The URL parsing logic in the dump utility correctly mirrors the main data store implementation with appropriate error handling and logging.

Comment on lines +92 to +177
inline S3UrlComponents ParseS3Url(const std::string &s3_url)
{
S3UrlComponents result;

if (s3_url.empty())
{
result.error_message = "S3 URL is empty";
return result;
}

// Find protocol separator
size_t protocol_end = s3_url.find("://");
if (protocol_end == std::string::npos)
{
result.error_message = "Invalid S3 URL format: missing '://' separator";
return result;
}

result.protocol = s3_url.substr(0, protocol_end);

// Validate protocol
if (result.protocol != "s3" && result.protocol != "http" &&
result.protocol != "https")
{
result.error_message = "Invalid protocol '" + result.protocol +
"'. Must be one of: s3, http, https";
return result;
}

// Extract the part after protocol
size_t path_start = protocol_end + 3; // Skip "://"
if (path_start >= s3_url.length())
{
result.error_message = "Invalid S3 URL format: no content after protocol";
return result;
}

std::string remaining = s3_url.substr(path_start);

// For http/https, extract the endpoint (host:port) and then the bucket/path
// Format: http(s)://{host}:{port}/{bucket_name}/{object_path}
if (result.protocol == "http" || result.protocol == "https")
{
// Find the first slash after the host:port
size_t first_slash = remaining.find('/');
if (first_slash == std::string::npos)
{
result.error_message =
"Invalid S3 URL format: missing bucket and object path";
return result;
}

// Store the full endpoint URL including protocol
result.endpoint_url = result.protocol + "://" +
remaining.substr(0, first_slash);
remaining = remaining.substr(first_slash + 1);
}

// Now extract bucket_name and object_path from remaining
// Format: {bucket_name}/{object_path}
size_t first_slash = remaining.find('/');
if (first_slash == std::string::npos)
{
result.error_message =
"Invalid S3 URL format: missing object path (format: {bucket_name}/{object_path})";
return result;
}

result.bucket_name = remaining.substr(0, first_slash);
result.object_path = remaining.substr(first_slash + 1);

if (result.bucket_name.empty())
{
result.error_message = "Bucket name cannot be empty";
return result;
}

if (result.object_path.empty())
{
result.error_message = "Object path cannot be empty";
return result;
}

result.is_valid = true;
return result;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -name "rocksdb_config.h" -type f

Repository: eloqdata/store_handler

Length of output: 127


🏁 Script executed:

cat -n ./eloq_data_store_service/rocksdb_config.h | head -200

Repository: eloqdata/store_handler

Length of output: 8065


🏁 Script executed:

rg "ParseS3Url" -A 5 -B 5 --type cpp --type h

Repository: eloqdata/store_handler

Length of output: 3809


🏁 Script executed:

# Check if there are any tests for ParseS3Url
find . -name "*test*.cpp" -o -name "*test*.h" | xargs grep -l "ParseS3Url" 2>/dev/null

Repository: eloqdata/store_handler

Length of output: 50


Add validation for edge cases in URL parsing.

The ParseS3Url function has valid validation logic but misses several edge cases:

  1. Trailing slashes: URLs like s3://bucket/path/ will include the trailing slash in object_path, which may be inconsistent with expected behavior.

  2. Double slashes: Handling is inconsistent—s3://bucket//path is accepted (storing /path as object_path), while http://host:9000//bucket/path is rejected due to empty bucket_name. Consider normalizing or consistently rejecting such URLs.

  3. Port validation: For http/https URLs, the endpoint URL is stored as-is without validating that the port (if present) is numeric. For example, http://host:invalid/bucket/path would be accepted with endpoint_url = "http://host:invalid", potentially causing downstream failures.

  4. URL encoding: The parser does not decode URL-encoded characters (e.g., %20), though this may be acceptable depending on whether URLs are expected to be pre-decoded from configuration sources.

Consider adding explicit validation for numeric ports and documenting expected behavior for URLs with trailing/multiple slashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant